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Abstract 



Inference for partially observed Markov process models has been a longstanding methodolog- 
ical challenge with many scientific and engineering applications. Iterated filtering algorithms 
maximize the likelihood function for partially observed Markov process models by solving a 
recursive sequence of filtering problems. We present new theoretical results pertaining to the 
1^ ' convergence of iterated filtering algorithms implemented via sequential Monte Carlo filters. This 

, theory complements the growing body of empirical evidence that iterated filtering algorithms 

provide an effective inference strategy for scientific models that have been intractable via alter- 
native methodologies. 

^ ■ 1 Introduction 

■ Partially observed Markov processes are of widespread importance throughout science and engineer- 

ing. As such, they have been studied under various names including state space models (Durbin and 
I Koopman, 2001), dynamic models (West and Harrison, 1997) and hidden Markov models (Cappe 

' et al., 2005). Applications include ecology (Newman et al., 2008), economics (Fernandez- Villaverde 

O ■ and Rubrio-Ramirez, 2007), epidemiology (King et al., 2008), finance (Johannes et al., 2009), mete- 

orology (Anderson and Collins, 2007), neuroscience (Ergun et al., 2007) and target tracking (Godsill 
et al., 2007). One central and extensively studied issue is the reconstruction of unobserved com- 
ponents of the Markov process from the available observations. Reconstructing the current state 
: of the process (i.e., determining or approximating its conditional distribution given all current and 

previous observations) is known as filtering (Anderson and Moore, 1979; Arulampalam et al., 2002). 
Oftentimes one also wishes to draw inferences on unknown model parameters from data; we call 
these static parameters when we wish to distinguish them from the time- varying components of the 
Markov process. 

A successful numerical solution to the filtering problem enables evaluation of the likelihood 
function and therefore brings one tantalizingly close to efficient estimation of static parameters 
via likelihood-based approaches, either Bayesian or non-Bayesian. However, numerical instabilities 
typically arise which have inspired a considerable literature (Kitagawa, 1998; Liu and West, 2001; 
Storvik, 2002; lonides et al., 2006; Toni et al., 2008; Poison et al., 2008). As a generalization, the 
numerical complications derive from difficulties maximizing or numerically integrating a computa- 
tionally intensive approximation to the likelihood function with the possible additional concern of 



1 



Monte Carlo variability. In this article, we investigate approaches which itcratively carry out a fil- 
tering procedure to explore the likelihood surface at increasingly local scales in search of a maximum 
of the likelihood function. Such methodology, called iterated filtering, has been shown capable of 
addressing state-of-the-art inference challenges (lonides et al., 2006; King et al., 2008; Breto et al., 
2009; He et al., 2009). In Section 2 we develop new theoretical results for iterated filtering. The 
previous theoretical foundation for iterated filtering, presented by lonides et al. (2006), did not 
engage directly in the Monte Carlo issues relating to practical implementation of the methodology. 
It is relatively easy to check that a (local) maximum has been attained, and therefore one can view 
the theory of lonides et al. (2006) as motivation for an algorithm whose capabilities were proven 
by demonstration. However, the more complete theory presented here gives additional insights into 
the capabilities, limitations and practical implementation of iterated filtering. 

Not all estimation techniques for static parameters are based on solving the filtering problem. 
Stochastic expectation-maximization and Markov Chain Monte Carlo approaches (Cappe et al., 
2005) work instead with the conditional distribution of entire trajectories of the Markov process 
given the observations. Some estimation methods entirely avoid explicit computation of these con- 
ditional distributions (Kendall et al., 1999; Reuman et al., 2006; Toni et al., 2008) but thereby lose 
the generality and statistical efficiency of likelihood-based inference frameworks. Further discussion 
of the relative merits of iterated filtering compared to other available methodology is postponed to 
the discussion in Section 3. Proofs of the theorems stated in Section 2 are given in Section 4. 

2 Notation and main results 

Let {x{t),t € T} be a Markov process taking values in R'^^ (Rogers and Williams, 1994). The 
time index set T C M may be an interval or a discrete set, but we are primarily concerned with 
a finite subset of times < t2 < • • • < tAr at which x{t) is observed, together with some initial 
time to < ti. We write xo:n = {xq, . . . ,xn) = {x{to), . . . ,x{tN))- We correspondingly denote the 
observation process by yi^N = (yi, . . . , yjv), with y„ taking a value in R'^y . We assume the existence 
of all required joint and conditional densities for xo:N and yi:N- These densities are supposed to 
depend on an unknown parameter vector 9 taking a value in W^^. A partially observed Markov 
model may then be specified at times touV by an initial density /(.tqI^), conditional transition 
densities /(.t„ | .t„_i, (?) for 1 < n < A^, and the conditional densities of the observation process 
which have the form /(?/„ | yi-.n-i, xi:n, 0) = f{yn \ Xn, 0). This notation overloads /(• | •) as a generic 
density which is specified by its arguments, and suppresses the distinction between random variables 
and their realizations. The generic function / gives a notationally compact representation of a 
partially observed Markov model, which can be rigorously formalized. Appendix A provides further 
discussion of generic function notation. 

Iterated filtering involves introducing a sequence of approximations to the model / in which a 
time-varying parameter process {9n,0<n< N} is introduced. Specifically, equations (1-3) define 
a model g for a Markov process ~[(^ni ^n)? ^ ^ A^} and observation process y\\N' Assuming 
/ is continuously parameterized as a function of 9, we see from (1-3) that ^(aJoiAr, yi;Ar | ^, cr, r) 
approaches f{xQ;N-,yi:N \ as both a — > and r — > 0. 



g {Xn 1 9n I Xji—l , 9n-i,9,a,T) 
giVn I Xn,9n,9,a,T) 

g{xo,9o\9,a, r) 




(1) 
(2) 
(3) 
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Here, k is a probability density function on M"^" which specifics a random walk for 9n- Prom (1), the 
increments of the random walk are independent of the current state of the process Xn- We suppose 
that the distribution corresponding to k has mean zero and covariance matrix S, so that 

E[en\en-l,e,a,T]=en-l, YaT{en\en-l,e,a,T)=a^^, (4) 

E[eo\e,a,T]=e, Var(^o|^,(^,r) = r2s. (5) 

Reparameterization of / may be required to ensure that 9 can take all values in W^^. In practice, 
K is typically a multivariate normal density, which must be truncated to meet condition (A4) of 
Theorem 1. Previous theory for iterated filtering (lonides et al., 2006) did not require a scale 
family g{6n \ 6n-i,<7) = a~^K(^a~^[9n — ^n-i]) for the time-varying distribution of However, 
this specification is natural and will be shown to lead to more transparent convergence conditions. 
We refer to a, r, k and S as algorithmic parameters since they play a role in the iterated filtering 
algorithm but are not part of the statistical model specified by /. The choice of algorithmic 
parameters may affect the numerical efficiency of iterated filtering algorithms, but does not affect 
the resulting statistical conclusions. 

We define the log likelihood function to be £{9) = logf{yi:N \ 6)- We write V for a vector of 
partial derivatives with respect to each component of 9, and for the Hessian matrix of second 
partial derivatives. A result underpinning iterated filtering is that V^(^) can be approximated in 
terms of moments of the filtering distributions for g. Specifically, the following Theorem 1 relates 
this derivative to the filtering means and prediction variances for g, defined as 

9^ = 9^{9,a,T)=E[9n\yi:n,e,a,T] 
Vn = V^{9, a, t) = \a.T{9n \ yi-.n-i, e, a,r) 

for n = 1,...,A'^, with 9q = 9. We assume the regularity conditions (A1-A4) below, with | • | 
denoting the absolute value of a vector or the largest absolute eigenvalue of a square matrix. 

(Al) There is a constant Ci{9) such that < /(• | •, 9) < Ci(9) for all joint and conditional densities 

of xq-n and yi-N- Additionally, C\{9) is bounded on compact subsets of W^^ . 
(A2) /(• I -,9) is twice continuously differentiable with respect to 9. Further, 

\^f{yn\Xn,0)\ < GO, {V"^ f {yn\ Xn, 9)\ < GO, 

J |V/(x„+i |a;„,6')| dxn+i < oo, J |V^/(a;„+i | a;„, 6*)! dx„+i < oo, 

J \Vf{xo I 9)\dxo < GO, J \V^f{xo I 9)\dxo < oo, 

with these bounds being uniform over all Xn and y„ with 9 ranging over any compact subset 
of M''^. 

(A3) k{9) is twice continuously differentiable, with J \ V'^k{9) \ d9 < oo. 

(A4) There is a constant C2 with k{9) = for \9\ > C2 and k{9) > for \9\ < C2. 

The conditions (Al) and (A2) axe not restrictive. Conditions (A3) and (A4) can be satisfied by 
the choice of the algorithmic parameters. The assumption of a spherical support for k in (A4) is 
mathematically convenient but we believe this requirement could be relaxed to some more general 
assumption of compact support. 
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Theorem 1. Suppose conditions (A1-A4)- Let u he a function of t with ut -"^ — ^ as r — 0. 
Using notation from (6), 

?-oEK)"'«-d)=V^W. (7) 

n=l 

A proof of Theorem 1 is given in Section 4.1, based on a Taylor series expansion of g{yn \ Ui-.n-i-, &n-> <7, t) 
around On = On-v Theorem 1 is based on a result of lonides et al. (2006), however both the as- 
sumptions employed and the details of the proof differ substantially from this previous work. 

The quantities 9^ and in Theorem 1 do not usually have closed form, and so numerical 
approximations must be made for any practical application of this result. Numerical approximation 
of moments is generally more convenient than approximating derivatives, and this is the reason that 
Theorem 1 may be useful. However, on might suspect that there is no "free lunch" and therefore the 
numerical calculation of the left hand side of (7) should become fragile as a and r becomes small. 
Wc will see that this is indeed the case, but that iterated filtering methods mitigate the difficulty to 
some extent by averaging numerical error over subsequent iterations. To be concrete, we suppose 
henceforth that numerical filtering will be carried out using the basic sequential Monte Carlo 
(SMC) method presented as Algorithm 1. SMC provides a flexible and widely used class of filtering 
algorithms, with many variants designed to improve numerical efficiency (Cappe et al., 2007). 
The relatively simple SMC method in Algorithm 1 is more readily comprehended, analyzed and 
implemented. It has also been found adequate for previous data analyses using iterated filtering 
(lonides et al., 2006; King et al., 2008; Breto et al., 2009; He et al, 2009). We suspect that the 
qualitative conclusions obtained here would apply to variations on Algorithm 1. 



Input: 




parameter vector, tjj 




observations yi;N and times to:Af 




generic density h{- \ •,V') for yi-i^[ and an unobserved process zq;n 


number of particles, J 




Procedure: 




1 initialize filter particles Z^^^ ~ h{zQ | V') for j in 


1 : J 


2 for n in 1 : 




3 for j in 1 : J draw prediction particles Z^- ^ hi^Zn \ Zn-i=Z^_^ •,■)/;) 


4 set w{n, j) = h{yn\ Zn=Z^j,'tp) 




5 draw ki, . . . ,kj such that Prob(A;j=i) = w{n 


,i)IY.iW{n,t) 


6 set Z^. = Zl^^ 




7 end for 





Algorithm 1: A basic sequential Monte Carlo procedure for a discrete-time Markov process {^z^^ 
with generic density function h. In the current context, Zn will be either Xn or {xn,9n) with h 
correspondingly set to / or g respectively. The resampling in step 5 is taken to follow a multinomial 
distribution to build on previous theoretical results making this assumption (Del Moral and Jacod, 
2001; Crisan and Doucet, 2002). An alternative is the systematic procedure in Arulampalam et al. 
(2002, Algorithm 2) which has less Monte Carlo variability; we support the use of systematic 
sampling in practice and we suppose that all our results would continue to hold in such situations. 

To calculate Monte Carlo estimates of the quantities in (6), we apply Algorithm 1 to the 
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model g with Zn = {xn,On), ip = {0,a,T) and J particles. We write Z^j = (X^^- , 0^^) and 
^nj ~ {-^nj ' ®n j) ^^'^ Monte Carlo samples from the filtering and prediction calculations in 
Algorithm 1. Then, using x' to denote the transpose of x, we define 



= ^tf (0, a, r, J) = ^ E/=i (e^,, - (e^,, - ^i)'- 

We now present an analogue to Theorem 1 in which the filtering means and prediction variances 
are replaced by their Monte Carlo counterparts. A proof of this result is given in Section 4.3. 

The stochasticity in Theorem 1 is due to Monte Carlo variability, conditional on the data yi-.N, 
and wc write E and Var to denote Monte Carlo means and variances. The Monte Carlo random 
variables required to implement Algorithm 1 are presumed to be drawn independently each time 
the algorithm is evaluated. 

Theorem 2. Let {am}, {^mj and {Jm} be positive sequences with Tm 0, (JmT^ — > and 
TmJm oo. Define O^^m = 0n{^^(^m,Jm) and V^^m = Vn,m{G,crm,Jm) via (8). Supposing condi- 
tions (A1-A4), 

N 



m— >oo 

n=l 

N 



Jim^T^J„Var(^(l^'J-^(<„-Ci,m)) < oo, (10) 



n=l 



with convergence being uniform for 6 in compact sets. 

Theorem 2 suggests that a Monte Carlo method which leans on Theorem 1 will require a 
sequence of Monte Carlo sample sizes, Jm, which increases faster than r~^. Otherwise, the 
Monte Carlo bias in estimating 9^ — 9^_i, which is of order Tm/ Jm, will eventually dominate the 
information in 9^ — 9^_^ about \/£{9), which is of order r^. Even with TmJm oc, we see from 
(10) that the estimated derivative in (9) may have increasing Monte Carlo variability as m — > oo. 
This trade-off between bias and variance is to be expected in any Monte Carlo numerical derivative, 
a classic example being the Kiefer-Wolfowitz algorithm (Kiefer and Wolfowitz, 1952; Spall, 2003). 
Algorithms which are designed to balance such trade-offs have been extensively studied under the 
label of stochastic approximation (Kushner and Yin, 2003; Spall, 2003). 

Theorem 3 gives an example of a stochastic approximation procedure, defined by the recursive 
sequence 9m in (11). Because each step of this recursion involves an application of the filtering 
procedure in Algorithm 1, we call (11) an iterated filtering algorithm. To prove the convergence 
of this algorithm to a value 9 maximizing the log likelihood function i{9) we make the following 
assumptions, which are standard sufficient conditions for stochastic approximation methods. 

(Bl) Define Z{t) to be a solution to dZ/dt = V£{Z{t)). Suppose that 9 is an asymptotically stable 
equilibrium point, meaning that (i) for every r/ > there exists a 6{ri) such that \Z{t) — 9\<ri 
for alH > whenever \Z{0) — 9\ < 6, and (ii) there exists a 6o such that Z{t) 9 as t ^ oo 
whenever \Z{0) — ^| < 6o. 

(B2) With probability one, sup^ \9m\ < oo. Further, 9m falls infinitely often into a compact subset 
of {Z{0) : limt_oo Z{t) = 9}. 
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Although neither (B1-B2) nor alternative sufficient conditions (Spall, 2003, Chapter 4) are easy to 
verify, stochastic approximation methods have nevertheless been found effective in many situations. 
Condition (B2) is most readily satisfied if 9m is constrained to a neighborhood in which ^ is a 
unique local maximum, which gives a guarantee of local rather than global convergence. Global 
convergence results have been obtained for related stochastic approximation procedures (Maryak 
and Chin, 2008) but are beyond the scope of this current paper. Practical implementation issues 
are discussed in Section 3 below. 

Theorem 3. Let {a„i}, {am}, {Tm} {Jm} be positive SeQUenceS with Tm — ^ 0? ^rn'^m^ — ^ ^' 
JmTm OO; (^m 0, = oo and '^m^m'^m^'''m^ < Specify a recursive sequence of 

parameter estimates {0m} by 

N 
n=l 

where 9^m — ^ni^m, <^rm Jm) d^d V^^m — ^n,rnK^m-, <7mj J-m) o,re defined in (8) via an application of 
Algorithm 1. Assuming conditions (A1-A4) and (B1-B2), lim^^oo = ^ with probability one. 

The proof of Theorem 3, given in Section 4.5, is based on applying Theorem 2 in the context 
of existing results on stochastic approximation. The rate assumptions in Theorem 3 are satisfied, 
for example, by am = m~^, Tm = m"^, am = m~^^'^^^ and Jm = m^"^"*"^/^^ for S > 0. 

3 Discussion of the theory and practice of iterated filtering 

One of the attractive features of the iterated filtering algorithm in Theorem 3 is that it depends on 
the unobserved Markov process only through generation of sample paths. In particular, this proce- 
dure can be implemented in situations where an expression for transition densities is unavailable. 
This property has been called plug-and-play (Breto et al., 2009; He et al., 2009) since it permits 
simulation code to be simply plugged into the inference procedure, enabling scientists to analyze 
multiple alternative models with only minor changes to the computations involved. The iterated 
filtering algorithm in Theorem 3 inherits the plug-and-play property from Algorithm 1. Alternative 
implementations of iterated filtering, for example making use of computationally efficient variations 
on sequential Monte Carlo (Cappe et al., 2007), do not generally possess the plug-and-play property. 

Other plug-and-play methods proposed for partially observed Markov models include artificial 
parameter evolution method (Liu and West, 2001), an approximate Bayesian computation (ABC) 
approach (Toni et al., 2008), and simulation-based forecasting (Kendall et al., 1999). In all these 
methods, statistical efficiency is sacrificed at the altar of computational convenience. For example, 
the stochasticity added for the artificial parameter evolution of Liu and West (2001) dilutes the in- 
fiuence of the earlier observations in the time series; ABC methods work only with a low-dimensional 
summary statistic of the data; non-likelihood-based methods such as least square prediction min- 
imization (Kendall et al., 1999) are also generally statistically efficient. Iterated filtering, on the 
other hand, has the efficiency guarantee of maximum likelihood estimation. Of course, the resulting 
estimates are still subject to Monte Carlo variability due to the infeasibility of attaining the infinite 
limit m — > oo. Ultimately, the value of all such asymptotic theory is dependent on its finite sample 
relevance. 

For challenging numerical computations, there is often a gap between available theorems and 
practical techniques. A classic example of this is optimization by simulated annealing, a popular 
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stochastic optimization technique (Kirkpatrick et al., 1983; Spall, 2003) which draws on physi- 
cal insights from statistical mechanics and mathematical foundations from Markov chain theory. 
Theoretically motivated convergence rates for simulated annealing are often too slow for practical 
implementation while variations on the algorithm with less attractive theory have been found to 
be effective (Ingber, 1993). Although there arc substantial differences between simulated annealing 
and iterated filtering (e.g. global versus local theory, exact versus stochastic objective functions), 
the similarities between these two stochastic search algorithms nevertheless provide a worthwhile 
comparison. To relate simulated annealing and iterated filtering, it is helpful to adopt from simu- 
lated annealing an analogy whereby am and are thought of as temperatures which approaching 
freezing as a„i — > and r ^ 0. If the temperature cools sufficiently slowly, iterated filtering and 
simulated annealing theoretically approach the maximum of their respective target functions. In 
practice, quicker cooling schedules are used for simulated annealing, in which case it is more prop- 
erly called simulated quenching (Ingber, 1993). Periodically increasing the the temperature, by 
chaining together quenched searches, is known as simulated tempering and can lead to a reasonable 
trade-off between investigating fine scale and larger scale structure of the objective function. It is 
generally possible to confirm the success of an optimization procedure by running it from multiple 
widely separated starting points, which makes possible post-hoc validation of the search strategy. 
Our experience suggests that tempered searches are an effective technique for iterated filtering. In 
addition, the rounds of quenching provide a sequence of parameter estimates which are useful for 
learning about the structure of the likelihood surface. 

Likelihood maximization provides not just point estimates of unknown parameters but also 
confidence intervals, either through profile likelihood calculations or Hessian approximations, and 
likelihood ratio tests of competing hypotheses. The interested reader is referred elsewhere for case 
studies demonstrating practical implementations of iterated filtering (King et al., 2008; Breto et al., 
2009; He et al., 2009; lonides et al., 2006). These practical implementations did not employ the 
increasing Monte Carlo sample size suggested by Theorem 3 and used a constant ratio OmT^n 
rather than a sequence tending to zero. Nevertheless, they were shown to be capable of maximizing 
complex likelihood surfaces to an adequate level of accuracy. Since SMC can provide an unbiased 
estimate of the likelihood function (see Corollary 1 in Section 4.3) it is relatively straightforward 
to confirm whether the likelihood has indeed been successfully maximized. 

The incorporation of iterated filtering into the framework of stochastic approximation, which 
underlies the proof of Theorem 3, suggests several avenues for further investigation. Existing mod- 
ifications of stochastic approximation techniques (Spall, 2003) include: (i) averaging parameter 
estimates across iterations; (ii) breaking down high-dimensional problems into a sequence of ran- 
domly selected lower dimensional problems; (iii) making use of a plug-and-play estimate of second 
partial derivatives. Iterated filtering, unlike generic stochastic approximation algorithms, uses the 
Monte Carlo variability in Algorithm 1 to explore the parameter space M''^ while simultaneously 
evaluating an approximation to the likelihood function and its derivative. This can lead to high 
numerical efficiency which is essential in situations which stretch available computational resources. 
For example, iterated filtering has been able to adequately identify a maximum likelihood estimate 
in a 13-dimensional space based on 50 iterations (lonides et al., 2006), a comparable computational 
burden to 50 evaluations of the likelihood function. There is undoubtedly potential to construct 
hybrid procedures which combine the strength of iterated filtering — making efficient use of few 
filtering operations to approach the maximum of the likelihood function — with the strengths of 
other methodologies. For example, a basic Kiefer-Wolfowitz algorithm (Spall, 2003) applied to an 
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unbiased SMC estimate of tlic likcliliood function would provide a sequence of estimators which 
converges to the maximum likelihood estimate with probability one, for a fixed Monte Carlo sample 
size (i.e., without the requirement — > oo in Theorem 3). 

The major challenge for likelihood-based inference in complex models is to identify a neighbor- 
hood containing those models which are plausibly consistent with the data. Once such a region 
has been identified, one then seeks to describe the likelihood surface in this neighborhood via con- 
struction of point estimates, confidence intervals and profile likelihood computations. A theoretical 
basis for this philosophy is Le Cam's quadratic estimation (Le Cam and Yang, 2000), in which 
the likelihood surface is approximated in a neighborhood of a -^/n-consistcnt estimator. Le Cam's 
ideas can be extended from quadratic approximation of the log likelihood surface to more practi- 
cally attractive smooth local likelihood approximations (lonides, 2005). These theoretical results 
highlight the statistical importance of correctly capturing the features of the likelihood on the scale 
of the uncertainty in the parameters. Smaller scale features in the likelihood surface, which may 
be a feature of the model or arise due to numerical considerations, are a distraction from effective 
inference. From this perspective, the efficient identification of plausible models which is the main 
strength of iterated filtering techniques is also the key step in model-based data analysis. 



4 Proofs of the theorems stated in Section 2 

The following proofs rely heavily on the generic probability density functions defined in Section 2. 
The reader is directed to Appendix A for a discussion on the formal use of this notation. Ad- 
ditionally, 0(r) = 0[i/j{t)) will mean that (^/■^ is bounded, and ^(r) = o(V'(t)) will mean that 
lim-j-^o <p / ip = 0. In Sections 4.1 and 4.2, we write Vg and Vg^ for vectors of partial derivatives 
with respect to the components of 9 and 9n respectively. 



4.1 A proof of Theorem 1 

Suppose inductively that \V^\ = O(r^) and — 9\ = 0{t^), which holds for n = 1 by con- 

struction. We now employ Bayes' formula, suppressing the dependence of g on 9, a and r to 
give 



n I yi:n ) 9iy n I yi:n—li On) 



n I yi:n— 1 ) I diVn I yi:n-l, 9n) 9{0n \ Vl-.n-l) d0n 
^ 9{yn\yi:n-l,On=0Ll) + (^n" Cl)' Ve„g(j/n|yi:n-l, 6'n= g^_i) +Rl 
giyn\yi:n-l,0n=eLl)+OiT^) 

= {l + {9n- 9^_JVe„ \ogg{yn \ yi:n-l,0n= Cl) 

Rl 



(12) 
(13) 



+ ^ , ,F , } X (1 + 0{r')) . (14) 

9{yn\yi:n-l,0n=9n-i)^ 

The numerator in (13) comes from a Taylor series expansion of g{yn | yi-.n-i, 9n) about 9n = 9n-i- 
Prom (Al), (A2) and (A3) there is a constant C3 such that is bounded by C^lOn — 0^_iP/2. 
This assertion is reasoned formally as Lemma 3 in Section 4.2. The denominator of (13) then 
follows from applying this expansion to the integral in (12) by observing that E[9n | yi-.n-i] ~ 



'n-l 
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and, from the induction hypothesis, E[\9n — On-i\^ I yi-.n-i] = 0{t'^). We now calculate 



" ^n-l — E[0n - On-l I Vl-n] 

= I {On-e^_i)g{en\yi:n)den (15) 

= log 5(yn I yi:n-l,en= Cl) + O^T^) (16) 

= Ve log fiVn I yi:n-l,d) + o{t^). (17) 



Equation (16) follows from (15) using (14) and (A4). Equation (17) follows from (16) via Lemma 1 
in Section 4.2 below. The inductive assumption on 9^ is then justified by (17). A similar argument 
gives 

^n+i = Var(0„+i I yi:n) = Var(6ln | yi;n) + o^Y. 

= Em,- 9f,_,){d^- Cl)' I yv.n] - « - Cl)(^n - Cl)' + (18) 

= F/ + a2E + o(T2), (19) 

where (19) follows from (18) via (14) and (17) together with the induction hypothesis on . 
Equation (19) in turn justifies this hypothesis. Multiplying (17) through by (V^)~^ and then 
summing over n gives (7), completing the proof of Theorem 1. 

4.2 Additional details on the proof of Theorem 1 

The passage from (16) to (17) may appear natural, given the smoothly parameterized sequence of 
approximations by which g approaches /. However, there is in fact some subtlety which explains 
the necessity of the two approximation parameters a and r with (jt~^ 0. If the variability of 
g{Si:n I ^0) o') ''") is small compared to the variability of g{9Q \ 9, a, r) then, heuristically, one expects 
g{9o:n-i\yi:n:(^n,(^,o;T) to be Concentrated around 9n in the limit as r — >■ 0. Lemma 1 takes 
advantage of a formalization of this limit. However, the issue may be of minor relevance in practice 
because one expects that g{9n-k:n-i \ yi:n,(^n) will indeed be concentrated around 9n when k<^n 
even if a is not small compared to r. Under typical mixing conditions, the distribution of y„ given 
yi:n-i,&0:n, <7 depends only weakly on 6'o:(n-fe-i) unless k is small. Introducing mixing conditions 
typically improves the theoretical properties of filtering procedures (e.g., Crisan and Doucet, 2002). 
We conjecture that one could achieve a result similar to Lemma 1 for a constant ratio aT~^ in a 
limit with some appropriate mixing properties, though investigating such scenarios is outside the 
scope of this article. 

Lemma 1. Suppose the conditions (A2) and (A^). In the limit as t ^ Q, 

V0„log5(j/n| 9 + CT, 9, a, r) = log /(y„ | yi:n-i, 6*) + o(l) (20) 

uniformly over < c < C2 — S for any S > 0. In particular, (20) holds for On = ^n-i- 

Proof. We suppress the dependence of g on 0, a and r. Adopting the notation that {^o:n = V"} 
means {9^ = 1^,0 < k < n}, it follows from (A2) and (A4) that 

^Og g{yn\yi:n-l,On=0 + CT) = log g{yn\ yi:n-l, Oo:n = + Ct) + O {a) 

= log f{yn\yi:n-i,0= 9 + CT) + 0{a), (21) 



9 



with the term 0{a) being uniform over sets of values of c bounded away from C2. Further details 
on this step are provided in Appendix C. We then calculate 

Ve„ log giVn I On= + cr, yi:n-l) 

= lim { log g{yn \ 9n= 9 + ct + 5, yi-.n-i) - log g{yn \ On= + cr, yi-.n-i)} 

= lim {log f{yn\yi:n-\,0 + CT + 5) -log f{yn\y\:n-l,0 + Ct) + R2} 

0— >0 

where R2 is 0(c), uniformly in 5 as long as 5 is small compared to r. Setting 5 = 5{t) with 
5t~^ — and a5~^ 0, it follows that 

^9„ log g{yn I yi:n-l, On=0 + Ct) 

= Vg log I e = e + cT) + 0{a/5) + o(l) (22) 

= Velogf{yn\yi:n-ue)+o{l). (23) 

To justify (23) it is necessary to notice that the term o(l) in (22) depends only on 5 and not on 
a or r. The uniformity of (23) over c allows one to choose c = c(r) = {9^ — 6)/t, completing the 
proof of Lemma 1. □ 

We now proceed to derive the uniform bound on Vl^g{yn\yi:n-ii9n) required to bound Ri in 
(13). To prove this result, stated as Lemma 3, we present a bound on Ve„5(yn|yi:n-i5 ^n) and 
then argue that the method can be extended to the second derivative at the expense of additional 
routine algebra. By contrast, a bound on 'Vg^g{y„]yi;r,-i,On) can be obtained more directly from 
Lemma 1, but this approach does not generalize to the second derivative unless a = o(t^). 

Lemma 2. (AI-A4) implies V e^g{yi:n, On I 0, a, r) = EILi (Pi + Vi) +Vq + Wq where 

C^i = y [Ve/(yj|a;i,6'i)]£?(j/i:i_i,yi+i:„,XO:n,^0:n|6',Cr,r)da;o:nCi6'o:n-l, 

K = / [V./(x< I x<,.„ I x<, %„, e,.,r) 

g{yi:i-l,Xi:i-i I 9o:n, 9, O", r)g{9o:n \ 9, C7, t) dxo:n d9o:n-l 
Vb = y [Vgf{xo I 9o)]g{yi;n, Xx-.n I Xq, 9o:n, 9, cr, T)g{9Q:n \ 9, cr, r) da;o:n C^^n-l, 

^0 = ^eg{yi:n,9n \ 9,C7,t). 

Proof. Integrating g{yi:n,XQ-n,9o:n\9,cr,T) over xo:n and ^o:n-i and passing Vq^ through the re- 
sulting integral gives Vg^g{yi:n, 9n \ 9, a, t) = Un + Tn for 



» n 

/n/fe 

7=1 



C7 



'j I Xj,9j)f{xj I Xj-i,9j-.i)f{xo I 6*0) X V^iK 
J-Hk (*^) X i„ (^!!LZ^) i.„,„rf,„_, (24) 



Noticing that Vej/«([^i — 9i-i]/a) = — Vej_iK([^i — 0i-i]/o") and applying integration by parts 
to (24) one finds that Ti = Vi + + Tj_i for 2 < i < n. A very similar calculation gives 
T\ = Vi + Vq + Wq, completing the proof of Lemma 2. □ 
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Lemma 3. Supposing (AI-A4), it follows that Vg 5(yn I yi:n-i, ^n, ^5 c", 7") is uniformly hounded 
for a and t in a neighborhood of zero, over compact regions of 9 with \0n — 0\ < ct for < c < C2. 

Proof. From (A1-A4) it follows that Ui, . . . , Un, Vq, . . . ,Vn and Wq in Lemma 2 are 0(r~^) and 
therefore that Vd^g{yi:n,On | 9,a,T) = 0{t^^). Suppressing dependence on 6, a and r to write 

:n— 1 ) Gn) 

^oMyn I yi:n-U0n) = [giy,.^.M? 

and observing that C4 r^^ < g{yi:n, On \ 0-, t) < C5 for some positive constants C4 and C5, we 
find Ve^g{yn \yi:n-i->(^n:(^i(^-,T) = 0{1). This argument can be routinely extended to the second 
derivative, completing the proof of Lemma 3. □ 

4.3 A proof of Theorem 2 

Let Un^rn = (^nm ~ ^n-im)/'^m ^^'^ Vn,m = ^101/''"™- Thc Corresponding Monte Carlo estimates 

of these quantities are Un,m = (^n,m ~ 0n-i,m)l'^m Vn,m = ^n,m/^m- We claim that there are 
constants Ce, . . . , Cg with 

I £^[Sn,m - Un,m] \ < Ce/Jm \E[vn,m - Vn,m] \ < C-j / Jm (25) 

E[\Un,m - Un,m?'] < Cg/Jm E[\Vn,m - Vn,mf] < Cg/Jm (26) 

uniformly for in any compact set. Previous bounds similar to (25,26) have been given for a fixed 

model as the Monte Carlo sample size JfYi incrccises, for exaniple by Del IVIoral and Jacod (2001)5 
Del Moral (2004, Section 11.8.4); Crisan and Doucet (2002). The comphcation in (25,26) is that 
the model is varying with am and Tm- However, the uniform bound on Un,m and Vn,m, together with 
the continuity of g{- \ -tU^t) as a function of a and r, is enough to show that a convergence result 
for fixed models (Crisan and Doucet, 2002, Theorem 2) applies uniformly in this context. Further 
details of this argument are deferred to Section 4.4. Carrying out a Taylor series expansion, we 
find 

where |i?3| < Cio{\un,in — itn,mP + \vn,m ~ 'Vn,m\'^) for some constant CiQ. The existence of such a 
Cio is guaranteed since the determinant of Vn,m is bounded away from zero. Taking expectations 
of both sides of (27) and applying (25,26) gives 

\E[Vn)n^n,m\ - Vm^n,m| < C'n/J^. (28) 

Another Taylor series expansion, 

'^n,m'^'n,m — '^n,m'^n,m + -^4 (29) 
with |i?4| < Ci2(|Sn,m " Un,m\ + \Vn,m " Vn,m\) implies 

) < Ci3/Jm. (30) 

Putting together (28) and (30), we see that 

E[{yn,m} (^n,m ~ ^?i-l,m)] ~ (Ki,m) (^?i,m ~ ^n-l,m) + ^ (-'-/('-'^m'^m)) 

Va'r[K™)"'(^n,m-^n-i,J] = 0{l/{alJm)). 
Theorem 2 then follows via the assumed continuity with respect to 6. 
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4.4 Justification of the sequential Monte Carlo bounds in Section 4.3 

We draw on the general theory of sequential Monte Carlo by Crisan and Doucet (2002) and 
Del Moral and Jacod (2001). For completeness, and to help the reader translate these results 
into the current context and notation, we have included the following two theorem statements. 

Theorem 4. (CrisAN and Doucet, 2002) Let h{- \ -,6) be a generic density for an unobserved 
Markov process zq-n with observations yi:jv and parameter vector 0. Define Z^- via applying 
Algorithm 1 to h{- \ •,^) with J particles. Assume that h{yn \ Zn,0) is bounded as a function of z^- 
For any (p : R'^^ — > M, denote the filtered mean of (p{zn) and its Monte Carlo estimate by 



'Pn = I (piZn)h{Zn \ yi-.n, 9) dZn, = \ f{^n,j) ■ (^1) 



There is a C14 independent of J such that 

Em-^'„f]< ^"""'f^^^^\ (32) 
Specifically, C14 can be written as a linear function of 1 and 77^,1, • • • , 7?n,n defined as 



k=n—i+l 



Theorem 5. (Del Moral and Jacod, 2001) As in Theorem 4, let h{- \ -,6) be a generic density 
for an unobserved Markov process zq-^n with observations yi;N and parameter vector 6. Let : 
M*^^ — > R 6e a bounded function, with ip^ and tp^ specified in (31). Define the un-normalized 
filtered mean (p^ and its Monte Carlo estimate ip^ by 

n n ^ J 

k=l k=l j=l 

where w{k,j) is computed in Step 4 of Algorithm 1 when evaluating (p^. Then 

ml\ = ^l. (35) 



E 



fe=i 



Corollary 1. Setting (p{z) = 1 in (34) we see that 1^ = 11^=1 ^(2/fc I ^)- Suppressing the 

dependence on yi-.N, we write 1^ = L{6), the likelihood function for h. Correspondingly, one can 
write 1^ = L{9). Lt follows from (35) that the Monte Carlo likelihood L{9) is an unbiased estimate 

ofL{e). 



Some intuition arising from Theorem 5 is that the bias in using the Monte Carlo estimate (p^ for 
(p^ arises due to the nonlinearity of the normalization procedure. From (35), we see that is an 
unbiased estimate of p^^. Defining the unit function l(x) = 1, it also follows that 1^ is an unbiased 
estimate of 1^. However, Ip^ = ^J^/l^ is generally a biased estimate of = (/?^/l^. The 
un-normalized filtered mean in Theorem 5 is not usually a quantity of direct interest (Corollary 1 
being an exception). However, Theorem 5 is necessary to justify (37) and (38) below. 
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The bound in terms of (33) was not exphcitly mentioned by Crisan and Doucet (2002) but is a 
direct consequence of their proof (see Appendix B). The uniform bound in (26) then fohows from 
the observation that rjk in (33) is continuous as a function of a in the context of Theorem 2. To show 
that (25) fohows from (26) we foUow the approach of Del Moral and Jacod (2001, Equation 3.3.14). 
Noting that = (p^/l^ and tp![ = tp^/ 1^, Theorem 5 implies the identity 

E[^^ -<p^\ = E [{^l -<p^)(l- ^) ] . (37) 
Applying the Cauchy-Schwarz inequality, together with (32) and (36), gives 

|^[?J-^ri|<0..?5E^. (38) 

The uniform bound in (25) follows from the observation that the bounding constants in (32) and 
(36) are continuous as a function of a in the context of Theorem 2. 

4.5 A proof of Theorem 3 

Theorem 3 follows directly from a general stochastic approximation result, presented as Theorem 6 
below. In the context of Theorem 3, conditions (B5) and (B6) of Theorem 6 hold from Theorem 2 
and the remaining assumptions of Theorem 6 hold by hypothesis. 

Theorem 6. Let i{9) be a continuously differentiable function M'^" — > R and let {Dm{0),m > 1} be 
a sequence of independent Monte Carlo estimators of the vector of partial derivatives V£{9). Define 
a sequence {9m} recursively by 9m+i = 9m + o-mDmiPm)- Assume (B1-B2) of Section 2 together 
with the following conditions: 

(B4) am > 0, Om ^ 0, J2m^m = OO. 

(B^) I]m«mSup|0|<r Var(L»„(6')) < oo for every r > 0. 
(B6) lim^^oo supi^K, \E[Dm{9)] - Vl{9) \ = for every r > 0. 
Then 9m converges to 9 = argmax£(0) with probability one. 

Theorem 6 is a special case of Theorem 2.3.1 of Kushner and Clark (1978). The most laborious 
step in deducing Theorem 6 from Kushner and Clark (1978) is to check that (B1-B6) imply that, 
for all e > 0, 

n+j ^ 
sup «m{^m(^m) " E[Dm{9m) | ^m] } > e] = 0, (39) 



lim P 

n— >oo 



m=n 



which in turn implies condition A2.2.4 of Kushner and Clark (1978). To show (39), we define 

= E>m{9m) - E[Dm{9m) \ Om] and 

ck _ \ Cm if \9m\ ^ k . . ^ 

" 1 if i^^i > k • 

Define processes {M;=E^t^„a„U, j > 0} and {Mf E^t^„ a^^m^i > 0} for each k and n. 

These processes are martingales with respect to the filtration defined by the Monte Carlo stochas- 
ticity. Prom the Doob-Kolmogorov martingale inequality (e.g., Grimmett and Stirzaker, 1992), 

P[sup|M;'^| > e] < ^ ^ sup Y^T{Dm{9)). (41) 

3 ^ m=n \d\<k 
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Define events F„ = {sup^ |M"| > e} and Fn^k = {supj |M"'*^| > e}. It follows from (B5) and (41) 
that lim„^oo P{Fn,k} = for each k. In light of the non- divergence assumed in (B2), this implies 
lim^^oo -P{-^n} = which is exactly (39). 

To expand on this final assertion, let ft = {sup^|^m| < oo} and $7^ = {sup„|6'm| < k}. 
Assumption (B2) implies that P{0,) = 1. Since the sequence of events {f^fc} is increasing up to fi, 
we have linife^oo P{^k) = Pi^) = 1- Now observe that Qk H Fnj = 17^ n for all j > k, as there 
is no truncation of the sequence {^m, m = 1,2, . . .} for outcomes in ilk when j > k. Then, 

lim P[Fn] < lim P[F„ n Ofe] + 1 - P[J7fc] 

n—^oo n— ►oo 

= lim P[F„,fe n Ofe] + 1 - P[i}k] 

n— >oo 

< lim P[Fn,k] + 1 - Ppk] 

n— >oo 

= l-Pl^k]- 

Since k can be chosen to make 1 — Pf^i^] arbitrarily small, it follows that lim^^oo -P[-^n] = 0. 



A Formalizing the use of generic functions via overloading 

A partially observed Markov model was specified by a generic probability density functions g{- \ •). 
This was used to describe all joint and conditional distributions, with the argument to g{- \ •) 
specifying the density in question. For example, 

9{yn\yi:n-l,0n) (42) 

gives the distribution of given yi-.n-i and On- This notation has advantages that it is self- 
documenting and one does not have to define additional notation for the density of every new 
quantity that is brought into consideration. Ensuring that the notation results in correct mathe- 
matics requires some amount of care for the following two reasons: (i) we do not distinguish between 
random variables and their realizations; (ii) the meaning of a function should not usually depend 
on the name of the argument supplied, so f{x) and f{y) should correspond to the same function / 
evaluated at x or y respectively. 

An equally adaptable notation, which is more explicit but cumbersome to the point of being 
unusable, involves rewriting (42) as 

9Yr,\Yi:n-ue„iyn\yi:n-l,0n)- (43) 

The map from (42) onto (43) is an instance of function overloading, which for the current purposes 
is synonymous with the technical term function polymorphism in computer languages (Strachey, 
2000). To define a polymorphic function, which takes different functional forms depending on 
the arguments, one must label the arguments with types. We suppose that y„ has type Yn, and 
similarly On has type O^, etcetera. The overloaded function g{- \ •) in (42) then looks to the type of 
its arguments when it is evaluated via (43). Suppose that we wish to evaluate (42) at 0^ = On + en- 
We can achieve this by writing yiyn 

I yi;n-i-,On) if it is considered clear that ^* shoTild possess the 
same type as On, namely 0„. We also interpret g{yn \yi:n-i^On=0^ as an explicit instruction to 
evaluate g{yn \ yi-.n-i, On) with ^* being assigned the type of On- 

The arguments for and against overloading in (42) are essentially the same as those in com- 
puter languages. Overloading has a potential for conceptual clarity and conciseness which must 
be weighed against the potential cost in terms of errors arising from incorrect applications of the 
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typing rules. Overloading is fundamental to object oriented computer programming, and notation 
such as (42) has similarly been found useful for working with dynamic models. The formalization 
of (42) in terms of function overloading enables more confident use of this convenient notation. 



B Some additional details on Theorem 4 

Here, we seek to support our assertion in the statement of Theorem 4 that the proof in Crisan and 
Doucet (2002) implies the constant C14 in equation (32) can be written as a linear function of 1 
and 77„,i, . . . , rin,n defined as 



Hyk\yi:k-i,d) 



Below, we use the notation of Crisan and Doucet (2002) and a reader wishing to follow our argument 
is advised to have a copy of this article at hand. Crisan and Doucet (2002, Section V) introduced 
the following recursion: 



(VC+^^y (44) 



'n\n 

Cn\n-l = (1 + ^JCn-l\n-l) (46) 

where ||/i||n = sup^^ /i(y„|z„,^). Here, C is a constant that depends on the resampling procedure 
but not on the number of particles J. The constant C14 in equation (32) corresponds to c„|„. Now, 
(44-46) can be reformulated by routine algebra as 

(47) 
(48) 
(49) 

where g„ = ||/i||^ [/i(|/„|yi:„_i, 6*)] ^ and Ki, . . . are constants which do not depend on h, 9, 
yi;N or J. Putting (48) and (49) into (47), 

< Ki+ K2KsqnCn\n-l 

< Ki+K2K^Kiqn + K2K^K5qnCn-i\n-i- (50) 
Since 'qn,i = 9n??n-i,« for i < n, and r)n,n = Qn, the required assertion follows from (50). 

C Some additional details on the proof of Lemma 1 

We wish to show the validity of the assertion that 

^Og g{yn\yi:n-l,On= 9 + Ct) = log g{yn\ yi:n-l, do:n= & + Ct) + 0(cr). (51) 

We will instead show that 

giVn I yi:n-l,dn= + Ct) = g{yn \ Vl-.n-U 0G:n= + Ct) + 0{a), (52) 
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from which (51) follows since g is bounded away from zero on compact sets. Begin by defining 

I = giUn I yi:n-l, dn= + Ct) 

= J 9iyn\yi:n-l,0o:n-l,dn=0 + ct) g{9o:n-l\0n=0 + CT,yi:n) d9o:n-l- (53) 

Now a Taylor expansion gives 

g{yn I yi:n-l, do:n-l, dn= + Ct) = A + B , (54) 

where 

A = g{yn\yi:n-l,Oo:n=d + Ct) 

B = {9o:n-l - 9- Ct)'V eo:.a-l9{yn \ yi:n-l, 9n= 9 + CT, 9l,^_i) 

for some ^'o-n-i satisfying 

9G:n-l < 0O:n-l < + CT. (55) 

Here, it is understood that ^oin-i — + ct means 9^ < 9 + ct componentwise for each < A; < n— 1. 
Putting (54) into (53), 

I = j {A + B)g{9o..n-l\9n=9 + CT,yi..n)d9Q..n-l. 

Since A does not depend on ^o:n-i) and ^(.l.) is a density, we find that 

I = A + J Bg{9o:n-l\9n=9 + CT,yi.,n)d9o:n-l. (56) 

To show (51), it therefore suffices to argue that the second term of (56) is 0(cr). For ^o-n-i restricted 
to a compact subset of M.'^^ , (A2) guarantees the existence of a Cig such that 

\'^eo:n-l9{yn\yi:n-l,dn=d + CT,9o,n_l,Xi:n)\ < CiQ. 

Therefore, 

Veo:„-i5(yn|yi:n-l, 9n = 9 + CT, 6*0:71-1) 

= j V(?o^„_i5f(?/n I yi:n-l,On= 9+CT, 9l,n_i, Xi;n) g{xi;n \ yi:n-l,dn= 9+CT, 9l,n_i) dxi;n 

< Cie (57) 
since g{-\ ) integrates to 1. Using (57), we expand the second term of (56) to give 

J Bg{9o:n-l\9n=9 + CT,yi.,n)d9o:n_i 

= J {d0:n-l -0- CT)'Veo:n-l9{yn \ J/l:n-l, 6n= 9+CT, 6'o:n-l)5(6'o:n-l | dn= 9+CT, yi;n) d9Q:n-l 

< C16 J i9o:n-l - 9 - CTyg{9o:n-l\9n = 9 + CT, yi;n) d9o:n-l 

= Ci6{£;[^0:n-l|^n = 9 + Ct] - 9 - Ct} (58) 

= 0{a). (59) 
(59) follows from (58) by (A4), completing our demonstration of (51). 
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